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The solution of large, sparse constrained least-squares problems is a staple in scientific 
and engineering applications. However, currently available codes for such problems are 
proprietary or based on MAT LAB. We announce a freely available C implementation of the 
fast block pivoting algorithm of Portugal, Judice, and Vicente. Our version is several times 
faster than Matstoms' MAT LAB implementation of the same algorithm. Further, our code 
matches the accuracy of MATLAB's built-in 1 sqn on neg function. 
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1. INTRODUCTION 

The authors were recently faced with the challenge of finding a fast solver for the sparse non- 
negative least-squares problem (NNLS) to embed in a much larger scientific application. The 
problem is given by 

min-||v4x - 6II2 with x > (1) 
X 2 

where A is an m x n matrix, x G M."', b E M"' and m > n, and we assume A has full col- 
umn rank. This is a standard problem in numerical linear algebra (lUl 0]) which is handled by 
a number of commercial libraries |^ [12]) and by the MATLAB-based Sparse Matrix Tool- 
box of [5]. While these methods work well, their users must incur the overhead of a large math 
package or the expense and license restrictions of commercial libraries. There does not seem 
to be a freely available solver for this problem without these disadvantages. This motivated the 
development of tsnnls, a lightweight ANSI C implementation of the block principal pivoting 
algorithm of [7] which matches the accuracy of the MATLAB-based codes and is considerably 
faster. The code can be obtained at http : / /www . cs . dug . edu/ ~ pi at ek/ tsnnls/ or 
http : / / ada . math . uga . edu/ res earch/ sof tware/ tsnnls /( Users may redistribute 
the library under the terms of the GNU GPL. 



*Email: cantarel@math.uga.edu| 
^^Email: piatek@mathcs.duq.edu| 
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2. ALGORITHMS 

The following is a summary of our main algorithm as described in [7]. The fundamental obser- 
vation underlying the block principal pivoting algorithm is that Equation[T]can be rewritten (using 
the definition of the Li^ norm) as a quadratic program: 

min -{A^h^x + -x^Ai^Ax, with x > 0. (2) 

X 2 

Since A has full rank, A^A is positive-definite, and this is a convex program which can be rewritten 
as a linear complementarity problem: 

y = A^Ax - A'^b, y>0,x>0, {x, y) = 0. (3) 

The last condition means that the nonzero entries of x and y occupy complementary variables: any 
given position must vanish in x or y (or both). In fact, the nonzero entries in y represent variables 
in X which would decrease the residual Ax — b still further by becoming negative, and so are set 
to zero in the solution to the constrained problem. 

Suppose we have a division of the n indices of the variables in x into complementary sets F 
and G, and let xp and yc denote pairs of vectors with the indices of their nonzero entries in these 
sets. Then we say that the pair {xp, yc) is a complementary basic solution of EquationElif xp is 
a solution of the unconstrained least squares problem 

imn ]-\\AFXF-b\\l (4) 

xp<m.\''\ I 

where Ap\% formed from A by selecting the columns indexed by F, and yc is obtained by 

yc = Al {Apxp - h) . (5) 

\f Xp > and yc > 0, then the solution h feasible. Otherwise it is infeasible, and we refer to 
the negative entries of and yc as infeasible variables. The idea of the algorithm is to pro- 
ceed through infeasible complementary basic solutions of © to the unique feasible solution by 
exchanging infeasible variables between F and G and updating xp and yc by © and To 
minimize the number of solutions of the least-squares problem in ©, it is desirable to exchange 
variables in large groups if possible. In rare cases, this may cause the algorithm to cycle. There- 
fore, we fall back on exchanging variables one at a time if no progress is made for a certain number 
of iterations with the larger exchanges. 

The original block-principal pivoting algorithm works very well for what we call "numerically 
nondegenerate" problems, where each of the variables in F and G have values distinguishable 
from zero by the unconstrained solver in the feasible solution. If this is not the case, a variable with 
solution value close to zero may be passed back and forth between F and G, each time reported as 
slightly negative due to error in the unconstrained solver. We work around this problem by zeroing 
variables in the unconstrained solution that are within 10^^^ of zero. Although this strategy works 
well in practice, we have not developed its theoretical basis. Indeed, this seems to be an unexplored 
area: [7] do not discuss the issue in their original development of the algorithm and Matstoms' 
snnls implementation fails in this case. 

The details are summarized below. 
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Block principal pivoting algorithm (modified for numerically degenerate problems) 

Let F = 0, G = {1, . . . , n}, a; = 0, ?/ = -A^b, and p = 3. 
Set = oo. 

while {xp, Ug) is an infeasible solution { 

Set n to the number of negative entries inxp and uq. 
itn < N (the number of infeasibles has decreased) { 
Set N = n and p = 3. 

Exchange all infeasible variables between F and G. 
} else { 
if p > { 
Set p = p — 1. 

Exchange all infeasible variables between F and G. 
} else { 

Exchange only the infeasible variable with largest index. 

} 

} 

Update xp and yc by Equations HI and |5] 

Set variables in xp < 10^^^ and ye < 10~^^ to zero. 

} 

The normal equations solver. 

Solving EquationlUrequires an unconstrained least-squares solver. We will often be able to do 
this by the method of normal equations. Since some of our software design choices depend on the 
details of this standard method, we review them here. To solve a least-squares problem Ax = b 
using the normal equations, one solves 

A^Ax = A^b (6) 

using a Cholesky factorization of the symmetric matrix A^A. This is extremely fast. For an m x n 
dense matrix A, the matrix multiplication required to form A^A requires n^m flops, which is more 
expensive than the standard Cholesky algorithm which is known to take ^n^ + O(n^) flops. For 
our sparse matrix problems, we found a comparable relationship between the time required for a 
sparse matrix-multiply and the TAUCS sparse Cholesky algorithm. 

The numerical performance of this method can be a problem. The condition number of A^A 
is the square of the condition number k of A. For this reason, we must expect a relative error of 
about c/c^e, where e is the machine epsilon (~10~^^ in our double-precision code), and c is not 
large. As [3] points out, the Cholesky decomposition may fail entirely when K^e > 1, so we cannot 
expect this method to handle matrices with k > 10®. Our tests indicate that this simple analysis 
predicts the error in the normal equations solver very well (see Section|4|), so we can anticipate the 
accuracy of the solver by estimating the condition number of A^A. 

3. SOFTWARE ARCHITECTURE 

Our primary design goal in the development of t snnl s was to create the most efficient solver 
which met the user's accuracy requirements and did not depend on commercial software or re- 
stricted libraries. It is clear that the heart of the algorithm is the solution of the least-squares 
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problem in Equation|4lfor the new xp. But the way these solutions are used is quite interesting. In 
the intermediate stages of the calculation, we only use xp and yc to search for infeasible variables 
to shift between F and G. So we need only calculate correct signs for all the variables — beyond 
this the numerical quality of these solutions is unimportant. But the last solution of Equations HI 
and 121 is the result of the algorithm, so this solution must meet the user's full accuracy needs. 
Our implementation takes advantage of this situation by using the method of normal equations for 
the intermediate solutions of Equation |4| and then recomputing the final solution using the more 
accurate LSQR solver of |6]. 

The method of normal equations is already fast. But two of our implementation ideas improve 
its speed still further in our solver. As we mentioned in Section |2l computing A^A is the most 
expensive step in the normal equations solver. A first observation is that we need not form A'^ 
explicitly in order to perform this matrix multiplication, since A^Aij is just the dot product of 
the i^^ and columns of A. This provides some speedup. More importantly, we observe that 
each least-squares problem in t snnls is based on a submatrix Ap of the same matrix A. Since 
A^p is a submatrix of A^A, we can precompute the full A^A and pass submatrices to the normal 
equations solver as required. This is a significant speed increase. We make use of the TAUCS 
library of |8] for highly optimized computation of the sparse Cholesky factorization needed for 
the method of normal equations. 

We can estimate the relative error K^e of each normal equations solution by computing the con- 
dition number of A^p with the LAPACK function dpocon. Since we have already computed 
the Cholesky factorization of A^A as part of the solution, this takes very little additional time in 
the computation. This is used to determine when a switch to a final step with LSQR is necessary 
for error control. 

In order to simplify its' use in other applications, our library incorporates simplified forms of 
the TAUCS and LSQR distributions. These are compiled directly into our library, so there is no 
need for the user to obtain and link with these codes separately. 

4. SOFTWARE TESTING 

We tested our implementation using problems produced by the LSQR test generator which gen- 
erates arbitrarily sized matrices with specified condition number and solution (see |6] for details 
on how the generator works). We report on the relative error of our method with the problems 
of type P(80, 70, 4, x), which were typical of our test results. Here 80 and 70 are the dimensions 
of the matrix, each singular value is repeated 4 times, and x is a parameter which controls the 
condition number of the problem. For these matrices the exact solution was known in advance, so 
we could measure the relative error of our solutions as a function of condition number. 

The results of this test are shown in Figure [H The line of datapoints indicated by x shows the 
error in t snnls using only our normal equations solver. As expected, it fits very well to about 
jqk'^s where k is the condition number of the matrix and e is machine epsilon. The second set 
of data points (denoted by >) shows that we usually improve our relative error by 2 or 3 orders of 
magnitude by recomputing the final solution with LSQR. The third set of data points (denoted by o) 
plots the error from the MATLAB function 1 sqnonneg on these problems. For condition numbers 
up to 10^, we see that t snnls and 1 sqnonneg have comparable accuracy. But surprisingly, our 
method seems to be more stable than Isqnonneg for very ill-conditioned problems. 

We also tested the performance of our software against that of Isqnonneg and that of the 
snnls code of 0] . All of our timing tests were performed on a dual 2.0 GHz Power Macintosh G5 
running Mac OS X 10.3, compiling with gcc 3 . 3 and -03, and linking with Apple's optimized 
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FIG. 1 : This plot shows the relative error in the solution of a selection of 80 x 70 test problems generated 
by the LSQR test generator with inputs P(80, 70,4, x) and condition numbers varying from 10^ to 10^. 
The logarithm (base 10) of this error e is plotted against the logarithm of condition number for three codes: 
t s n n 1 s restricted to use only the normal equations solver, the final version of t s n n 1 s , which recomputes 
the final solution with LSQR, and the MATLAB function Isqnonneg. 

versions of LAPACK and BLAS. We ran snnls under MATLAB 7 with argument -no jvm. 

We were required to make two modifications to the snnls code to complete our tests. First, 
the snnls code uses the column minimum degree permutation (colmmd) before performing 
sparse Cholesky decompositions. However, as this ordering is deprecated in MATLAB 7 in favor 
of absolute minimum degree ordering, we tested against a modified snnls using colamd. This 
was a strict performance improvement for our test cases. We also made the same workaround to 
handle degenerate problems that we discussed for t snnls in Sectional 

Our performance results are shown in Figure El We tested runtimes for randomly generated, 
well-conditioned matrices from MATLAB's sprandn function. The matrices were of size n x 
{n — 10). The plot shows runtime results for a set of density 1 matrices and a set of density 0.01 
matrices, intended to represent general dense and sparse matrices. Each data point represents the 
average runtime for 10 different matrices of the same size and density. 

We can see that the runtime of our implementation is approximately proportional to that of 
snnls, and that for dense problems it is several times faster. We were surprised to note that the 
constant of proportionality decreases for sparse matrices and that our method is almost 10 times 
faster than snnls for matrices of density 0.01. 

The runtime of each code is controlled by three computations: the matrix-multiply used to 
form Ai^A, the Cholesky decomposition of that matrix, and the final recalculation of the solution 
(if performed). We expected to be several times faster than snnls since our caching strategy 
for A^A eliminates a matrix-multiply operation for each pivot. The number of pivots, however, 
does not seem to vary with the density of our random test matrices and so does not explain our 
additional speed increase for sparse problems. 

We explored this phenomenon by profiling both our code and snnls. For our random test 
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FIG. 2: These log-log scaled plots show the runtime of tsnnls and snnls on density 1.0 (left) and 0.01 
(right) matrices of size n x {n — 10) on a 2.0 Ghz Apple PowerMac G5. We can see that the runtime of 
tsnnls is basically proportional to that of snnls, but that the constant of proportionality depends on the 
density of the test matrices. This effect is explained below. All runtimes were calculated by repeating the 
test problems until the total time measured was several seconds or more. 



problems at density 0.01, the final unconstrained solution in snnls (computed using the the 
MATLAB \ operation) consumes almost 50% of the total runtime. On the other hand, in tsnnls 
the final unconstrained solution (using LSQR) consumes only 5% of runtime. Since the Cholesky 
decompositions take comparable time, tliis would seem to explain the runtime disparity. 

We did not show performance data for MAT LAB ' s built-in Isqnonneg because it was so much 
slower than both tsnnls and snnls. For sparse matrices, this is in part because Isqnonneg 
is a dense-matrix code. Yet, even on dense matrices, both methods outperformed Isqnonneg by 
an overwhelming amount. For instance, for a 500 x 490 dense matrix, Isqrnonneg takes over 
100 seconds to complete while snnls and tsnnls both finish in less than one second. We take 
this as a confirmation of the suggestion in MathWorks' documentation of Isqnonneg that it is 
not appropriate for large problems. 
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